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Abstract — The influence of DNA c/5-regulatory elements on a 
gene's expression has been intensively studied. However, little is 
known about expressions driven by trans -SLCting DNA hotspots. 
DNA hotspots harboring copy number aberrations are rec- 
ognized to be important in cancer as they influence multiple 
genes on a global scale. The challenge in detecting trans -effects 
is mainly due to the computational difliculty in detecting weak 
and sparse trans-SLCting signals amidst co-occuring passenger 
events. We propose an integrative approach to learn a sparse 
interaction network of DNA copy-number regions with their 
downstream targets in a breast cancer dataset. Information 
from this network helps distinguish copy-number driven from 
copy-number independent expression changes on a global scale. 
Our result further delineates cis- and trans-effects in a breast 
cancer dataset, for which important oncogenes such as ESRl 
and ERBB2 appear to be highly copy-number dependent. 
Further, our model is shown to be eflicient and in terms of 
goodness of fit no worse than other state-of the art predictors 
and network reconstruction models using both simulated and 
real data. 

I. Introduction 

Copy number alterations, including both germline variants 
(CNYs) and somatic copy number aberrations (CNAs) are 
associated with disease |[T|. Somatic aberrations (CNAs) 
are particularly important for tumourigenesis, contributing 
to genomic instability and gene deregulation. For example, 
oncogene activation by gene amplification or tumour sup- 
pressor loss as a result of gene deletion (Fig[T]) can cause 
transcriptional changes and contribute to the pathogenesis 
of cancer. On the other hand, expression can be influenced 
by the proximal genes within a several Mb window (cis- 
acting), but can also be afl'ected by a remote CNA throughout 
the genome (^ra^^'-acting), as depicted in Fig[T] One of 
the challenges in cancer genomics is to characterize such 
intermediate phenotypic changes caused by both cis- and 
trans- CNAs that ultimately lead to tumour progression. 

Detecting cis- and trans -regulatory effects: An inte- 
grative analysis of CNA and expression has the potential 
to uncover regulatory relationships between CNA and gene 
expression (Fig[T]). The availability of genome- wide copy 
number and gene expression data facilitates the discovery 
of such regulatory maps, but few studies have quantitatively 
assayed the efl'ect of both cis- and trans- copy number 
on the expression abundance in the same primary tumour 
sample. Given the inherent noise in such data and the 
large-scale of genomic information, the detection of both 
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Figure 1. Expression abundance level can be affected by both cis- 
and trans- acting copy number alterations, integrating gene expression 
data and gene dosage information measured by microarray or sequencing 
technologies. 

cis- and trans-SLCting efl'ects between DNA copy number 
and mRNA on a genome-wide basis is a difficult task. 
Such eff'ects are small and infrequent |2|, which further 
imposes computational challenges. Detection of the driving 
alterations amidst numerous random passenger events is 
complicated by the frequency of co-occurring events such 
as co-amplifications and co-deletions. Thus despite consid- 
erable eff'orts, the location of many relevant CNAs that result 
in gene disregulation remain elusive. 

Previous research: This area of research is of key 
importance since many somatic alterations could contribute 
to tumorigenesis and their identities are critical for the 
development of specific anti-cancer agents. Not surprisingly, 
the association between DNA copy number and mRNA 
expression data has gained considerable attention. One of 
the first papers to combine gene expression and DNA copy 
number data by 1 1 1 concluded that 62% of highly amplified 
genes demonstrated moderate to high expression levels. 
Others have similarly identified genes that exhibit gain/loss 
and simultaneous over/under-expression in cancer |3|, but 
it has been noted that chromosomal amplifications do not 
necessarily result in the over-expression of the genes located 
in the same cytoband |4|. 

In summary, recent studies have focused on the local 
eff'ects of copy number on expression, and few have consid- 



ered trans-SiCtmg effects on a global scale. Correlation analy- 
sis remains in widespread use for these types of comparisons 
due to its efficiency |5 |, and has more recently been used to 
query ^ra^^-effects on a genome-wide level |6|. Several more 
sophisticated tools have also been proposed. For example, 
Q made use of gene sets instead of individual genes in 
regression models in their efforts to integrate the two data 
types. However, simple regression has limited power in face 
of the immense amount of genomic data. Moreover, their 
focus is on CIS- events and therefore cannot incorporate 
remote interactions. 

Our approach: In order to find copy-number driven 
expression, we propose a new framework for reconstructing 
a genome-wide regulatory map on a genomic and tran- 
scriptomic level. The objective of the proposed framework 
is to set up an interaction network between CNA and 
expression changes (arrows in Fig[T]). By interpreting global 
gene expression changes in the context of CNA, the network 
has the potential to uncover both cis- and trans-rcgulsitory 
elements. Our approach combines several key ideas: First, 
we employ an Li -regularized regression model which al- 
lows simultaneous feature selection and model fitting which 
generalizes well in high-dimensions. Second, we introduce 
a combination of global and local search strategies to opti- 
mally set the regularization parameter in our model. Third, 
we quantitatively measure each transcript's copy-number 
dependence. 

II. Method 

We assume copy number data of p DNA regions X = 
(xi,X2, ...,Xp) and mRNA expression data of q genes Y = 
(yi^yi, --^yq), both available from n tumour samples. 

The framework for investigating CN trans- effect is built 
on a linear regression method with Li -regularization (sec- 
tion |II-A| ). The strength of regularization is chosen adap- 
tively for each gene by a combined global and local search 
over the space of regularization parameters (Section II-B[ ). 
The results are incorporated into a quantitative measure 
for copy-number dependence by comparing variances of 
the model residuals with and without a locus of interest 
(Section |IFCl ). 

A. Li-constrained regression selects most influential ge- 
nomic loci for each transcript 

We use Li constrained regression (a.k.a the Lasso |8|) 
to derive the causal effects of genome-wide copy number 
data, as the predictor variables, on gene expression profiles, 
as the response variables. L - 1 -regression minimizes the 
sum of squared errors between response and prediction, 
while keeping a bound on the sum of the absolute values of 
the regression coefficients. By penalizing the absolute sum 
of coefficients, Li -constrained regression effectively shrinks 
non- significant coefficients to zero. In contrast to this, L2- 
constrained regression (aka ridge regression) keeps a bound 
on the sum of the squared values of the regression coeffi- 
cients and generally results in small, but non-zero, regression 
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Figure 2. The Lasso path along a range of X values for MYC, a well 
known oncogene, in the experimental dataset. Each line corresponds to a 
regression coefficient measuring the influence of copy number variation at 
a particular locus (names on the right) on MYC expression. Regularization 
strength decreases with A from the left to the right. This can be seen by 
more and more coefficients hitting the zero line and vanishing from the 
model. 



coefficients. Hence, Li -regression is a prominent method 
in high-dimensional studies for its efficiency in performing 
sparse statistical inference on thousands of variables |[9|. 
Efficient computational methods exist to infer the entire path 
of variable inclusions over different choices of regularization 
strength |10| . 

In our scenario:: For gene / with expression profile j/, 
let XPi be the candidate prediction model with regression 
coefficients Pi = (J3ii,Pi2, ...,/3in). The Lasso model is then 
given by 



= argmin || ji-XPi 



ll2+^/IIAII 



(1) 



where ||x||2 = Z -^^ indicates the L2 norm measuring the 
distance between model prediction, and response and ||x||i = 
2 \xi\ indicates the Li norm used to constrain the regression 
coefficients. 

The Xi in Eq.Q is a regularization parameter controlling 
the sparsity and strength of regularization. By varying 
Li -regression/Lasso adaptively includes predictors in the 
model (see Figure |2]). The subscript "/" emphasizes that the 
optimization problem in Eq.Q has to be solved for every 
gene. In particular, for each gene a regularization parameter 
Ai has to be chosen. In the next section, we propose to utilize 
an efficient combination of a global and local search strategy. 

B. Adaptive regularization by combined global and local 
search 

The regularization parameter A is crucial since it directly 
determines how many predictor vectors are to be included 
into the model. We perform a two-step procedure for A 
selection that combines a global search over a discrete set 
of candidate values with a local search around the optimal 
candidate. For the global search the likelihood function is 



evaluated by cross-validation, for the local search Brent's 
minimization without derivatives jTTJ implemented in R 
package "penalized" 1 12] is used. Brent's method combines 
steps of golden section (a fast method based on binary 
divisions of the search space) with parabolic interpolation 
to efficiently zoom in onto the maximum. 

Step 1 Global search: We perform 10-fold cross-validation 
across a wide range of candidate A values (grey 
lines in Fig[3|. Log-likelihood scores (black dots) are 
computed for each of these values to provide a point 
of focus for the next step. 

Step 2 Local search: At the point of the maximum log- 
likelihood of Step 1 (black line in Fig [3]), we perform 
a local search (grey box) with Brent's optimization to 
locate the optimal value of A (dashed line). 




Figure 3. A selection procedure base on the log-likelihood function. The 
grey lines mark the global search points, and the two black lines (one solid, 
one dashed) mark the optimal selection points of A by Step 1 and 2. 

Having chosen an optimal A value and computing the 
Lasso solution of Eq.([T]) for each gene, we obtain a co- 
efficient matrix B = (J3i,^2, • • • ,Pn)- Because of the Li- 
constraint in the Lasso, this matrix will be sparsely populated 
and many coefficients will be zero. Matrix B represents the 
regulatory relationships between copy number changes and 
transcriptional changes that was indicated by arrows in Fig[T] 
A technical note: Lasso assumes the predictor variables 
to be linearly independent. However, DNA copy number data 
obtained from high resolution microarrays can exhibit high 
correlation between probes. Therefore, copy number data 
should be first tested for independence and, if necessary, the 
data should be merged with an appropriate method such as 
those implemented in R package "CGHregion" |13 | to pro- 
duce regions with independent signatures. We implemented 
our framework in R with the packages "penalized" and 
"CGHregions". 

C. A quantitative measure of copy-number dependence 

The proposed framework allows us to quantitatively mea- 
sure the dependency between gene expression and copy 
number, both in cis and in trans. The underlying assumption 
is that the predictive power of a model can be measured by 
the proportion of variance explained, i.e., the ratio of the 



variance of modelling residuals with or without incorporat- 
ing one or more predictors. More formally, given two models 
-one with and one without incorporating a set of predictors- 
the negative natural logarithm of the variance ratio between 
the prediction residuals. 
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can be used as a score indicating the dependency of the 
response variable on the predictor(s). Therefore, by checking 
if the variance of residuals is reduced after incorporating 
the copy-number predictors, the dependency of expression 
responses on copy number can be quantitatively measured. 

Furthermore, comparisons can be conducted with copy- 
number cis- and trans-, indicating whether a gene is signif- 
icantly influenced by the trans- efl'ect. Formally, we do this 
by comparing a sequence of nested models that successively 
incorporate more and more copy number predictors: from 
no predictor at all, to using only copy number changes in 
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Figure 4. Performance of three methods in the experiments of network 
structure learning, data is simulated with varied number of samples n and 
number of predictor variables p and number of response variables q,p = q. 



CIS, and finally to using both cis- and trans- copy number 
changes. 



= yi 



(3) 
(4) 
(5) 



Eq.([2]) then allows us to estimate the eff'ect of adding more 
predictors by measuring the diff'erence of variance explained 
between any two equations in Eq.([3])-([5|. For example, the 
dependency score of expression of a gene /' on copy number 
can be calculated based on Eq.Q and Eq.Q 



score 



both 



= -In 



(6) 



And the dependency score of expression yi on the copy 
number rra^^-eff'ect conditioned on the cis- eff'ect based on 
Eq.Q and Eq.([5| is 



/cr2 ^ 



score 



= -In 



(7) 



If the score is significant, there is likely to be a copy number 
rra^^-eff'ect on gene /. Note that by incorporating the cis- 
predictor Xi in both equation, we remove the cis-effect before 
exploring the ^ra^^-eff'ect Y^jj^i xj, which is equivalent to a 
conditional independence test. 

III. Experiments on simulated networks 

Simulated data was generated to test the performance of 
the Li method on network structure inference. Let X ~ 
A^(0, 1) be a matrix with p Gaussian variables each of n 
samples, and T be a sparse matrix of p hy q generated 
from randomly sampling 50 non-zero coefficients from a 
distribution Nifd, 1). From X and T, a response vector 
Y = XT with q variables is generated. Both X and Y are 
added with centered Gaussian noise N(0, cr^). The objective 
is to infer T from X and Y. 

Two popular network reconstruction methods in the liter- 
ature that are applicable to the small n, large p problem are 
compared with the Li method. They are the Gaussian Graph- 
ical Models 1 14] and Partial Least Square fTSl. However, to 
remove any ambiguity that could be caused by the diff'erent 
numbers of inferred relations, the number of relations, i.e. 
total number of non-zero coeflScients in T are known to all 
methods. 

The performance of the methods in comparison is judged 
by whether the non-zero coefficients in T can be correctly 
detected. Data is simulated with p, the number of variables 
in X, varying from 100 to 2000 and sample size n varying 
from 50 to 500. The number q of variables in Y is equal to 
q. For each setting of p and datasets are generated with 
diff'erent noise levels cr^ e {0.05,0.1,0.2}. For the result, 
precision and recall statistics for the inference results of 



all variables in Y are computed and averaged across noise 
levels. 

The precision and recall curves, grouped by values of p, 
are plotted in Fig|4] The advantage of the Li method can be 
clearly seen: in all scenarios precision and recall are higher 
than for the competitors. The diff'erence is most pronounced 
for large values of p, which is especially important for 
real- world applications. In summary the simulations show 
that the proposed method outperforms other state-of-the-art 
methods in its accuracy to reconstruct regulatory networks. 

IV. A GENOME- WIDE MAP OF CIS- AND ^m^i'-REGULATION IN 
BREAST CANCER 

To demonstrate the eff'ectiveness of the proposed method 
in a real-world application we used a breast cancer dataset 
with 89 samples analyzed using both mRNA expression and 
DNA copy number microarrays [ ,16J . 

A. A quantitative assessment of prediction accuracy 

While the simulations showed the accuracy of our frame- 
work in network reconstruction compared to other network 
models, on real data we can not repeat this analysis since 
the true network is unknown. However, a complementary 
question that we can answer is how well our model predicts 
gene expression from copy-number alterations compared to 
other state-of-the-art regression models. We compare Li- 
constrained regression with three other well known predic- 
tion models: Random Forests 1 17], SVM 1 18], and Recursive 
Partitioning |fT9l. Each method uses copy number data to 
predict expression data and accuracy is measured in terms 
of the mean squared error of the prediction. 

Models were trained with a subset of samples in the |fT6| 
data and then the fitted models were used to predict the rest 
of samples. In detail: 100 expression profiles are randomly 
selected from the Chin data. For each of the expression 
profile, models were fitted with 7/8 (78) of the samples 
and then the mean squared errors of predicting the test set 
1/8 (11) of the samples. The Random Forests method was 
trained with 500 trees so that the computational time is on 
the same scale as that of other methods. SVM was trained 
with a linear kernel. 

Results in Fig |5] show that the Li -constrained regression 
is not out-performed by any competitor. Additionally, it pro- 
vides a sparser solution than the other models: On average 
Li -regression used less than 7 features, while Recursive 
Partitioning used more than 32 and SVM and Random 
Forests use all of them. Thus, the Li -model combines 
sparsity with high prediction accuracy. This is important 
since the selected features directly point to the relevant 
copy number alterations leading to expression changes. This 
experiment proves the proposed framework's potential with 
applications on copy number and expression data. 
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Figure 5. Mean squared errors of predictions by four methods on Chin's 
dataset, the dashed Hne shows the median of mean squared errors of Lasso. 
The average of the mean squared errors for four methods are 0.526, 0.528, 
0.635, 0.707. 



B. The global impact of copy-number change on gene 
expression 

The proposed framework was used to study the impact of 
copy-number aberrations on every gene expression profile 
in the Chin dataset. Using Eq.(|6]) we can rank genes based 
on the copy-number dependent score, as shown in Table [l| 
Several known oncogenes are present such as ERBB2 and 
its neighbors, STARDl and MGC9753, as well as ESRL 
For example, ERBB2 is amplified in approximately 20% 
of the breast cancer cases for which concomitant over- 
expression is also typically observed. Many other genes with 
less well characterized functions were also identified and 
merit additional investigation, e.g. TFFl which is a stable 
secretory protein expressed in the gastric mucosa. 

Table I 

Top 20 genes ranked by the copy-number dependent score 



Gene 



TFFl 
NATl 
TFF3 
AGR2 

GABRP 
MGC9753 
ESRl 

S100A7 
PIP 

HNF3A 



Chr 



21 
8 

21 
7 
5 

17 
6 
1 
7 
14 



Score 



3.31 
2.73 
2.63 
2.61 
2.53 
2.5 
2.49 
2.44 
2.39 
2.38 



Gene 



MGC9753 
PROMLl 
CEACAM6 
SCGB2A2 
CPBl 
MLPH 
GRB7 
SFRPl 
ERBB2 
CEACAM5 



Chr 



17 
4 
19 
11 
3 
2 
17 
8 

17 
19 



Score 



2.38 
2.33 
2.31 
2.31 
2.27 
2.26 
2.26 
2.26 
2.24 
2.21 



Expressions driven by the same CNAs region are also 
enriched with cancer- specific functions. We test whether 
the downstream targets of these DNA regions share similar 
functions or motifs, since genes with similar functions are 
often co-regulated. Gene Ontology (GO) pQ| enrichment 
analysis was performed on the downstream targets for CNAs 
regions driving many expressions. Numerous GO terms 
with low values, indicative of a high degree of func- 
tional similarity, were identified. For example, the targets of 
22q 13.33 are significantly enriched for GO terms involving 
the immune response (/?- value 6E-11), lymphocyte activation 
(9E-11), and leukocyte activation (5E-10), with the former 
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Figure 6. Example 1: MYC. The upper plot shows the g-value (a multiple 
testing corrected p-value) of correlation coefficients between copy number 
and gene expression (red for positive correlations, blue for negative correla- 
tions). The lower plot shows the much sparser results from our framework, 
which illuminates key alterations by removing noise in the data. 



two pertaining to the adaptive immune response. 

Genetic variation at 8q24 has been associated with both 
breast and prostate cancer |[2T| and this region is frequently 
amplified in breast cancer. One of the top ranking hot spots, 
8q24.13, has downstream targets enriched for GO terms such 
as the positive regulation of ubiquitin-protein ligase activ- 
ity during mitotic cell cycle, and the proteasomal protein 
catabolic process. Promoter analysis revealed a significantly 
enriched unknown motif CGGCGACATA (hypergeometric 
test value 4.7E-19, FDR on 50 random samples value 
0.0044) in the upstream (-2000 bp) region. Another hot spot, 
9q22.2, exhibits trans-cffccts on T0P2A (17q21) and the 
MCM family, and has downstream targets enriched for DNA 
replication and mitotic cell cycle functions. Once again, 
these findings are in-line with existing knowledge on cancer 
biology amassed using various approaches. 

C. Individual example of copy-number driven expression 

We also examined the impact of copy number on individ- 
ual target expressions. For the purpose of comparison, cor- 
relation of gene expression and genome- wide copy number 
was performed, and the results after FDR (False Discovery 
Rate) correction under the Benjamini-Hochberg procedure 
| [22| is plotted in Fig [6] 

Fig|6ja) shows that there is very high correlation between 
the expression of the oncogene MYC and copy number 
3q26.2, 5ql2-5q34. In fact, these regions exhibit nearly as 
high a correlation as does 8q24 (which denotes the location 
of MYC). Here, the red histogram indicates positive correla- 
tion, while the blue histogram indicates negative correlation. 
Fig|6];b) illustrates that the dependence of MYC expression 



on copy number in 5q21-5q34, when conditioned on its own 
copy number (8q24), is much smaller than that of copy 
number in 8q24, and the dependence on 3q26 vanishes. 
This is due to the fact that co-occurring events, such as 
co-amplification, have been removed by the conditional 
dependence assumption in Lasso. Moreover, the eff'ects of 
copy number in 9q22.2 and llpll.l2 amongst others, shown 
in Fig|6ja) are completely removed when conditioning on 
the copy number at 8q24. In contrast, an apparently strong 
negative regulatory eff'ect from 17q21.31, which harbours 
WNK4 and the breast cancer susceptibility gene BRCAl 
persists, suggesting an interaction between these key genes. 

V. Discussion 

This paper presents an efficient framework for integrat- 
ing genome- wide copy number and expression data. Both 
cis- and trans- regulatory eff'ects from the genomic level 
to the transcriptomic level can be quantitatively measured 
using conditional probabilities. In particular, Li -constrained 
regression is able to select relevant predictor variables to 
be included into the statistical model. In summary, our 
proposed framework not only allows for a genome-wide 
description of the eff'ect of gene dosage on gene expres- 
sion, but also can partition copy-number dependent and 
independent transcriptional changes. While the former are 
useful to identify copy number and expression events co- 
ordinatedly deregulated during cancer progression, the latter 
are interesting as a starting point for validating putative 
oncogenes. Thus, our framework not only yields a global 
view of the interactions and suggests downstream targets of 
CNAs ripe for validation, it can also be used as a starting 
point for further focused analyses of the genomic basis of 
cancer. Our findings in the Chin dataset are currently being 
validated in another much larger breast cancer dataset, which 
has not yet been published. 
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